style space
Exploring speech style spaces with language models: Emotional TTS without emotion labels
Chandra, Shreeram Suresh, Du, Zongyang, Sisman, Berrak
Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.
- North America > Canada > Quebec > Montreal (0.05)
- North America > United States > Texas > Dallas County > Dallas (0.04)
- Asia (0.04)
Enhancing Industrial Transfer Learning with Style Filter: Cost Reduction and Defect-Focus
Li, Chen, Ma, Ruijie, Qian, Xiang, Wang, Xiaohao, Li, Xinghui
Addressing the challenge of data scarcity in industrial domains, transfer learning emerges as a pivotal paradigm. This work introduces Style Filter, a tailored methodology for industrial contexts. By selectively filtering source domain data before knowledge transfer, Style Filter reduces the quantity of data while maintaining or even enhancing the performance of transfer learning strategy. Offering label-free operation, minimal reliance on prior knowledge, independence from specific models, and re-utilization, Style Filter is evaluated on authentic industrial datasets, highlighting its effectiveness when employed before conventional transfer strategies in the deep learning domain. The results underscore the effectiveness of Style Filter in real-world industrial applications.
Face Identity-Aware Disentanglement in StyleGAN
Suwała, Adrian, Wójcik, Bartosz, Proszewska, Magdalena, Tabor, Jacek, Spurek, Przemysław, Śmieja, Marek
Conditional GANs are frequently used for manipulating the attributes of face images, such as expression, hairstyle, pose, or age. Even though the state-of-the-art models successfully modify the requested attributes, they simultaneously modify other important characteristics of the image, such as a person's identity. In this paper, we focus on solving this problem by introducing PluGeN4Faces, a plugin to StyleGAN, which explicitly disentangles face attributes from a person's identity. Our key idea is to perform training on images retrieved from movie frames, where a given person appears in various poses and with different attributes. By applying a type of contrastive loss, we encourage the model to group images of the same person in similar regions of latent space. Our experiments demonstrate that the modifications of face attributes performed by PluGeN4Faces are significantly less invasive on the remaining characteristics of the image than in the existing state-of-the-art models.
Make It So: Steering StyleGAN for Any Image Inversion and Editing
Bhattad, Anand, Shah, Viraj, Hoiem, Derek, Forsyth, D. A.
StyleGAN's disentangled style representation enables powerful image editing by manipulating the latent variables, but accurately mapping real-world images to their latent variables (GAN inversion) remains a challenge. Existing GAN inversion methods struggle to maintain editing directions and produce realistic results. To address these limitations, we propose Make It So, a novel GAN inversion method that operates in the $\mathcal{Z}$ (noise) space rather than the typical $\mathcal{W}$ (latent style) space. Make It So preserves editing capabilities, even for out-of-domain images. This is a crucial property that was overlooked in prior methods. Our quantitative evaluations demonstrate that Make It So outperforms the state-of-the-art method PTI~\cite{roich2021pivotal} by a factor of five in inversion accuracy and achieves ten times better edit quality for complex indoor scenes.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Matched sample selection with GANs for mitigating attribute confounding
Singh, Chandan, Balakrishnan, Guha, Perona, Pietro
Measuring biases of vision systems with respect to protected attributes like gender and age is critical as these systems gain widespread use in society. However, significant correlations between attributes in benchmark datasets make it difficult to separate algorithmic bias from dataset bias. To mitigate such attribute confounding during bias analysis, we propose a matching approach that selects a subset of images from the full dataset with balanced attribute distributions across protected attributes. Our matching approach first projects real images onto a generative adversarial network (GAN)'s latent space in a manner that preserves semantic attributes. It then finds image matches in this latent space across a chosen protected attribute, yielding a dataset where semantic and perceptual attributes are balanced across the protected attribute. We validate projection and matching strategies with qualitative, quantitative, and human annotation experiments. We demonstrate our work in the context of gender bias in multiple open-source facial-recognition classifiers and find that bias persists after removing key confounders via matching. Code and documentation to reproduce the results here and apply the methods to new data is available at https://github.com/csinva/matching-with-gans .
- North America > United States > California (0.04)
- North America > United States > Massachusetts (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Vision > Face Recognition (0.69)
Automated outfit generation with deep learning
We developed a machine learning model which is capable of completing an outfit based on a given seed product. Here we give an overview of our model and some of the challenges we faced. We consider an outfit to be a set of fashion items which match stylistically and can be worn together. In order for the outfit to work, each item must be compatible with all other items. Our aim is to create a model which embeds each item in a latent style space such that for any two items the dot product (a measure of similarity) of their embeddings reflects their compatibility.
DrumNet
Sony CSL Paris develops technology for AI-assisted music production. The goal is not to replace musicians, but to provide them with better tools to be more efficient in realizing their creative ideas. DrumNet is based on an artificial neural network which learns rhythmic relationships between different instruments and encodes these relationships in a 16-dimensional style space. A similar example is the Logic Pro X Drummer, allowing the user to specify the playing style by navigating a two-dimensional space. The difference of DrumNet to the Logic Pro X Drummer, however, is that it dynamically adapts to the existing music.
Fashion Outfit Generation for E-commerce
Bettaney, Elaine M., Hardwick, Stephen R., Zisimopoulos, Odysseas, Chamberlain, Benjamin Paul
Combining items of clothing into an outfit is a major task in fashion retail. Recommending sets of items that are compatible with a particular seed item is useful for providing users with guidance and inspiration, but is currently a manual process that requires expert stylists and is therefore not scalable or easy to personalise. We use a multilayer neural network fed by visual and textual features to learn embeddings of items in a latent style space such that compatible items of different types are embedded close to one another. We train our model using the ASOS outfits dataset, which consists of a large number of outfits created by professional stylists and which we release to the research community. Our model shows strong performance in an offline outfit compatibility prediction task. We use our model to generate outfits and for the first time in this field perform an AB test, comparing our generated outfits to those produced by a baseline model which matches appropriate product types but uses no information on style. Users approved of outfits generated by our model 21% and 34% more frequently than those generated by the baseline model for womenswear and menswear respectively.
- Europe > United Kingdom > England > Greater London > London (0.05)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Anime Style Space Exploration Using Metric Learning and Generative Adversarial Networks
Deep learning-based style transfer between images has recently become a popular area of research. A common way of encoding "style" is through a feature representation based on the Gram matrix of features extracted by some pre-trained neural network or some other form of feature statistics. Such a definition is based on an arbitrary human decision and may not best capture what a style really is. In trying to gain a better understanding of "style", we propose a metric learning-based method to explicitly encode the style of an artwork. In particular, our definition of style captures the differences between artists, as shown by classification performances, and such that the style representation can be interpreted, manipulated and visualized through style-conditioned image generation through a Generative Adversarial Network. We employ this method to explore the style space of anime portrait illustrations.
Deep Style Match for Complementary Recommendation
Zhao, Kui (Zhejiang University) | Hu, Xia (Hangzhou Science &) | Bu, Jiajun (Technology Information Research Institute) | Wang, Can (Zhejiang University)
Humans develop a common sense of style compatibility between items based on their attributes. We seek to automatically answer questions like "Does this shirt go well with that pair of jeans?" In order to answer these kinds of questions, we attempt to model human sense of style compatibility in this paper. The basic assumption of our approach is that most of the important attributes for a product in an online store are included in its title description. Therefore it is feasible to learn style compatibility from these descriptions. We design a Siamese Convolutional Neural Network architecture and feed it with title pairs of items, which are either compatible or incompatible. Those pairs will be mapped from the original space of symbolic words into some embedded style space. Our approach takes only words as the input with few preprocessing and there is no laborious and expensive feature engineering.
- Asia > China > Zhejiang Province > Hangzhou (0.05)
- Asia > Middle East > Jordan (0.04)